home *** CD-ROM | disk | FTP | other *** search
Text File | 1999-01-26 | 60.7 KB | 1,586 lines |
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- NNNNAAAAMMMMEEEE
- pcre - Perl-compatible regular expressions.
-
- SSSSYYYYNNNNOOOOPPPPSSSSIIIISSSS
- ####iiiinnnncccclllluuuuddddeeee <<<<ppppccccrrrreeee....hhhh>>>>
-
- ppppccccrrrreeee ****ppppccccrrrreeee____ccccoooommmmppppiiiilllleeee((((ccccoooonnnnsssstttt cccchhhhaaaarrrr ****_p_a_t_t_e_r_n, int _o_p_t_i_o_n_s,
- ccccoooonnnnsssstttt cccchhhhaaaarrrr ********_e_r_r_p_t_r, int *_e_r_r_o_f_f_s_e_t);
-
- ppppccccrrrreeee____eeeexxxxttttrrrraaaa ****ppppccccrrrreeee____ssssttttuuuuddddyyyy((((ccccoooonnnnsssstttt ppppccccrrrreeee ****_c_o_d_e, int _o_p_t_i_o_n_s,
- ccccoooonnnnsssstttt cccchhhhaaaarrrr ********_e_r_r_p_t_r);
-
- iiiinnnntttt ppppccccrrrreeee____eeeexxxxeeeecccc((((ccccoooonnnnsssstttt ppppccccrrrreeee ****_c_o_d_e, const pcre_extra *_e_x_t_r_a,
- ccccoooonnnnsssstttt cccchhhhaaaarrrr ****_s_u_b_j_e_c_t, int _l_e_n_g_t_h, int _o_p_t_i_o_n_s,
- iiiinnnntttt ****_o_v_e_c_t_o_r, int _o_v_e_c_s_i_z_e);
-
- iiiinnnntttt ppppccccrrrreeee____iiiinnnnffffoooo((((ccccoooonnnnsssstttt ppppccccrrrreeee ****_c_o_d_e, int *_o_p_t_p_t_r, ****_f_i_r_s_t_c_h_a_r_p_t_r);
-
- cccchhhhaaaarrrr ****ppppccccrrrreeee____vvvveeeerrrrssssiiiioooonnnn((((vvvvooooiiiidddd))));;;;
-
- vvvvooooiiiidddd ****((((****ppppccccrrrreeee____mmmmaaaalllllllloooocccc))))((((ssssiiiizzzzeeee____tttt))));;;;
-
- vvvvooooiiiidddd ((((****ppppccccrrrreeee____ffffrrrreeeeeeee))))((((vvvvooooiiiidddd ****))));;;;
-
- uuuunnnnssssiiiiggggnnnneeeedddd cccchhhhaaaarrrr ****ppppccccrrrreeee____ccccbbbbiiiittttssss[[[[111122228888]]]];;;;
-
- uuuunnnnssssiiiiggggnnnneeeedddd cccchhhhaaaarrrr ****ppppccccrrrreeee____ccccttttyyyyppppeeeessss[[[[222255556666]]]];;;;
-
- uuuunnnnssssiiiiggggnnnneeeedddd cccchhhhaaaarrrr ****ppppccccrrrreeee____ffffcccccccc[[[[222255556666]]]];;;;
-
- uuuunnnnssssiiiiggggnnnneeeedddd cccchhhhaaaarrrr ****ppppccccrrrreeee____llllcccccccc[[[[222255556666]]]];;;;
-
-
-
-
- DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN
- The PCRE library is a set of functions that implement
- regular expression pattern matching using the same syntax
- and semantics as Perl 5, with just a few differences (see
- below). The current implementation corresponds to Perl
- 5.004.
-
- PCRE has its own native API, which is described in this man
- page. There is also a set of wrapper functions that
- correspond to the POSIX API. See ppppccccrrrreeeeppppoooossssiiiixxxx ((((3333)))).
-
- The three functions ppppccccrrrreeee____ccccoooommmmppppiiiilllleeee(((()))), ppppccccrrrreeee____ssssttttuuuuddddyyyy(((()))), and
- ppppccccrrrreeee____eeeexxxxeeeecccc(((()))) are used for compiling and matching regular
- expressions. The function ppppccccrrrreeee____iiiinnnnffffoooo(((()))) is used to find out
- information about a compiled pattern, while the function
- ppppccccrrrreeee____vvvveeeerrrrssssiiiioooonnnn(((()))) returns a pointer to a string containing the
- version of PCRE and its date of release.
-
-
-
- Page 1 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- The global variables ppppccccrrrreeee____mmmmaaaalllllllloooocccc and ppppccccrrrreeee____ffffrrrreeeeeeee initially
- contain the entry points of the standard mmmmaaaalllllllloooocccc(((()))) and ffffrrrreeeeeeee(((())))
- functions respectively. PCRE calls the memory management
- functions via these variables, so a calling program can
- replace them if it wishes to intercept the calls. This
- should be done before calling any PCRE functions.
-
- The other global variables are character tables. They are
- initialized when PCRE is compiled, from source that is
- generated by reference to the C character type functions,
- but which the maintainer of PCRE is free to modify. In
- principle they could also be modified at runtime. See PCRE's
- README file for more details.
-
-
-
- MMMMUUUULLLLTTTTIIII----TTTTHHHHRRRREEEEAAAADDDDIIIINNNNGGGG
- The PCRE functions can be used in multi-threading
- applications, with the proviso that the character tables and
- the memory management functions pointed to by ppppccccrrrreeee____mmmmaaaalllllllloooocccc
- and ppppccccrrrreeee____ffffrrrreeeeeeee will be shared by all threads.
-
- The compiled form of a regular expression is not altered
- during matching, so the same compiled pattern can safely be
- used by several threads at once.
-
-
-
- CCCCOOOOMMMMPPPPIIIILLLLIIIINNNNGGGG AAAA PPPPAAAATTTTTTTTEEEERRRRNNNN
- The function ppppccccrrrreeee____ccccoooommmmppppiiiilllleeee(((()))) is called to compile a pattern
- into an internal form. The pattern is a C string terminated
- by a binary zero, and is passed in the argument _p_a_t_t_e_r_n. A
- pointer to the compiled code block is returned. The ppppccccrrrreeee
- type is defined for this for convenience, but in fact ppppccccrrrreeee
- is just a typedef for vvvvooooiiiidddd, since the contents of the block
- are not defined.
-
- The size of a compiled pattern is roughly proportional to
- the length of the pattern string, except that each character
- class (other than those containing just a single character,
- negated or not) requires 33 bytes, and repeat quantifiers
- with a minimum greater than one or a bounded maximum cause
- the relevant portions of the compiled pattern to be
- replicated.
-
- The _o_p_t_i_o_n_s argument contains independent bits that affect
- the compilation. It should be zero if no options are
- required. Those options that are compabible with Perl can
- also be set at compile time from within the pattern (see the
- detailed description of regular expressions below) and all
- options except PCRE_EXTENDED, PCRE_EXTRA and PCRE_UNGREEDY
- can be set at the time of matching.
-
-
-
- Page 2 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- If _e_r_r_p_t_r is NULL, ppppccccrrrreeee____ccccoooommmmppppiiiilllleeee(((()))) returns NULL immediately.
- Otherwise, if compilation of a pattern fails, ppppccccrrrreeee____ccccoooommmmppppiiiilllleeee(((())))
- returns NULL, and sets the variable pointed to by _e_r_r_p_t_r to
- point to a textual error message.
-
- The offset from the start of the pattern to the character
- where the error was discovered is placed in the variable
- pointed to by _e_r_r_o_f_f_s_e_t, which must not be NULL. If it is,
- an immediate error is given.
-
- The following option bits are defined in the header file:
-
- PCRE_ANCHORED
-
- If this bit is set, the pattern is forced to be "anchored",
- that is, it is constrained to match only at the start of the
- string which is being searched (the "subject string"). This
- effect can also be achieved by appropriate constructs in the
- pattern itself, which is the only way to do it in Perl.
-
- PCRE_CASELESS
-
- If this bit is set, letters in the pattern match both upper
- and lower case letters in any subject string. It is
- equivalent to Perl's /i option.
-
- PCRE_DOLLAR_ENDONLY
-
- If this bit is set, a dollar metacharacter in the pattern
- matches only at the end of the subject string. By default,
- it also matches immediately before the final character if it
- is a newline (but not before any other newlines). The
- PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
- set. There is no equivalent to this option in Perl.
-
- PCRE_DOTALL
-
- If this bit is set, a dot metacharater in the pattern
- matches all characters, including newlines. By default,
- newlines are excluded. This option is equivalent to Perl's
- /s option. A negative class such as [^a] always matches a
- newline character, independent of the setting of this
- option.
-
- PCRE_EXTENDED
-
- If this bit is set, whitespace characters in the pattern are
- totally ignored except when escaped or inside a character
- class, and characters between an unescaped # outside a
- character class and the next newline character, inclusive,
- are also ignored. This is equivalent to Perl's /x option,
- and makes it possible to include comments inside complicated
-
-
-
- Page 3 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- patterns.
-
- PCRE_MULTILINE
-
- By default, PCRE treats the subject string as consisting of
- a single "line" of characters (even if it actually contains
- several newlines). The "start of line" metacharacter (^)
- matches only at the start of the string, while the "end of
- line" metacharacter ($) matches only at the end of the
- string, or before a terminating newline. This is the same as
- Perl.
-
- When PCRE_MULTILINE it is set, the "start of line" and "end
- of line" constructs match immediately following or
- immediately before any newline in the subject string,
- respectively, as well as at the very start and end. This is
- equivalent to Perl's /m option. If there are no "\n"
- characters in a subject string, or no occurrences of ^ or $
- in a pattern, setting PCRE_MULTILINE has no effect.
-
- PCRE_EXTRA
-
- This option turns on additional functionality of PCRE that
- is incompatible with Perl. Any backslash in a pattern that
- is followed by a letter that has no special meaning causes
- an error, thus reserving these combinations for future
- expansion. By default, as in Perl, a backslash followed by a
- letter with no special meaning is treated as a literal.
- There are two extra features currently provided, and both
- are in some sense experimental additions that are useful for
- influencing the progress of a match.
-
- (1) The sequence \X inserts a Prolog-like "cut" into the
- expression.
-
- (2) Once a subpattern enclosed in (?>subpat) brackets has
- matched,
- backtracking never goes back into the pattern.
-
- See below for further details of both of these. PCRE_EXTRA
- can be set by a (?X) option setting within the pattern, but
- this must precede anything in the pattern which relies on
- its being set.
-
- PCRE_UNGREEDY
-
- This option inverts the "greediness" of the quantifiers so
- that they are not greedy by default, but become greedy if
- followed by "?". It is not compatible with Perl. It can also
- be set by a (?U) option setting within the pattern.
-
-
-
-
-
- Page 4 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- SSSSTTTTUUUUDDDDYYYYIIIINNNNGGGG AAAA PPPPAAAATTTTTTTTEEEERRRRNNNN
- When a pattern is going to be used several times, it is
- worth spending more time analyzing it in order to speed up
- the time taken for matching. The function ppppccccrrrreeee____ssssttttuuuuddddyyyy(((()))) takes
- a pointer to a compiled pattern as its first argument, and
- returns a pointer to a ppppccccrrrreeee____eeeexxxxttttrrrraaaa block (another vvvvooooiiiidddd
- typedef) containing additional information about the
- pattern; this can be passed to ppppccccrrrreeee____eeeexxxxeeeecccc(((()))). If no additional
- information is available, NULL is returned.
-
- The second argument contains option bits. The only one
- currently supported is PCRE_CASELESS. It forces the studying
- to be done in a caseless manner, even if the original
- pattern was compiled without PCRE_CASELESS. When the result
- of ppppccccrrrreeee____ssssttttuuuuddddyyyy(((()))) is passed to ppppccccrrrreeee____eeeexxxxeeeecccc(((()))), it is used only if
- its caseless state is the same as that of the matching
- process. A pattern that is compiled without PCRE_CASELESS
- can be studied with and without PCRE_CASELESS, and the
- appropriate data passed to ppppccccrrrreeee____eeeexxxxeeeecccc(((()))) with and without the
- PCRE_CASELESS flag.
-
- The third argument for ppppccccrrrreeee____ssssttttuuuuddddyyyy(((()))) is a pointer to an error
- message. If studying succeeds (even if no data is returned),
- the variable it points to is set to NULL. Otherwise it
- points to a textual error message.
-
- At present, studying a pattern is useful only for non-
- anchored patterns that do not have a single fixed starting
- character. A bitmap of possible starting characters is
- created.
-
-
-
- MMMMAAAATTTTCCCCHHHHIIIINNNNGGGG AAAA PPPPAAAATTTTTTTTEEEERRRRNNNN
- The function ppppccccrrrreeee____eeeexxxxeeeecccc(((()))) is called to match a subject string
- against a pre-compiled pattern, which is passed in the _c_o_d_e
- argument. If the pattern has been studied, the result of the
- study should be passed in the _e_x_t_r_a argument. Otherwise this
- must be NULL.
-
- The subject string is passed as a pointer in _s_u_b_j_e_c_t and a
- length in _l_e_n_g_t_h. Unlike the pattern string, it may contain
- binary zero characters.
-
- The options PCRE_ANCHORED, PCRE_CASELESS,
- PCRE_DOLLAR_ENDONLY, PCRE_DOTALL, and PCRE_MULTILINE can be
- passed in the _o_p_t_i_o_n_s argument, whose unused bits must be
- zero. However, if a pattern is compiled with any of these
- options, they cannot be unset when it is obeyed.
-
- There are also two further options that can be set only at
- matching time:
-
-
-
- Page 5 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- PCRE_NOTBOL
-
- The first character of the string is not the beginning of a
- line, so the circumflex metacharacter should not match
- before it. Setting this without PCRE_MULTILINE (at either
- compile or match time) causes circumflex never to match.
-
- PCRE_NOTEOL
-
- The end of the string is not the end of a line, so the
- dollar metacharacter should not match it. Setting this
- without PCRE_MULTILINE (at either compile or match time)
- causes dollar never to match.
-
- In general, a pattern matches a certain portion of the
- subject, and in addition, further substrings from the
- subject may be picked out by parts of the pattern. Following
- the usage in Jeffrey Friedl's book, this is called
- "capturing" in what follows, and the phrase "capturing
- subpattern" is used for a fragment of a pattern that picks
- out a substring. PCRE supports several other kinds of
- parenthesized subpattern that do not cause substrings to be
- captured.
-
- Captured substrings are returned to the caller via a vector
- of integer offsets whose address is passed in _o_v_e_c_t_o_r. The
- number of elements in the vector is passed in _o_v_e_c_s_i_z_e. This
- should always be an even number, because the elements are
- used in pairs. If an odd number is passed, it is rounded
- down.
-
- The first element of a pair is set to the offset of the
- first character in a substring, and the second is set to the
- offset of the first character after the end of a substring.
- The first pair, _o_v_e_c_t_o_r[_0] and _o_v_e_c_t_o_r[_1], identify the
- portion of the subject string matched by the entire pattern.
- The next pair is used for the first capturing subpattern,
- and so on. The value returned by ppppccccrrrreeee____eeeexxxxeeeecccc(((()))) is the number
- of pairs that have been set. If there are no capturing
- subpatterns, the return value from a successful match is 1,
- indicating that just the first pair of offsets has been set.
-
- It is possible for an capturing subpattern number _n+_1 to
- match some part of the subject when subpattern _n has not
- been used at all. For example, if the string "abc" is
- matched against the pattern "(a|(z))(bc)", subpatterns 1 and
- 3 are matched, but 2 is not. When this happens, both offset
- values corresponding to the unused subpattern are set to -1.
-
- If a capturing subpattern is matched repeatedly, it is the
- last portion of the string that it matched that gets
- returned.
-
-
-
- Page 6 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- If the vector is too small to hold all the captured
- substrings, it is used as far as possible, and the function
- returns a value of zero. In particular, if the substring
- offsets are not of interest, ppppccccrrrreeee____eeeexxxxeeeecccc(((()))) may be called with
- _o_v_e_c_t_o_r passed as NULL and _o_v_e_c_s_i_z_e as zero. However, if the
- pattern contains back references and the _o_v_e_c_t_o_r isn't big
- enough to remember the related substrings, PCRE has to get
- additional memory for use during matching. Thus it is
- usually advisable to supply an _o_v_e_c_t_o_r.
-
- Note that ppppccccrrrreeee____iiiinnnnffffoooo(((()))) can be used to find out how many
- capturing subpatterns there are in a compiled pattern.
-
- If ppppccccrrrreeee____eeeexxxxeeeecccc(((()))) fails, it returns a negative number. The
- following are defined in the header file:
-
- PCRE_ERROR_NOMATCH (-1)
-
- The subject string did not match the pattern.
-
- PCRE_ERROR_BADREF (-2)
-
- There was a back-reference in the pattern to a capturing
- subpattern that had not previously been set.
-
- PCRE_ERROR_NULL (-3)
-
- Either _c_o_d_e or _s_u_b_j_e_c_t was passed as NULL, or _o_v_e_c_t_o_r was
- NULL and _o_v_e_c_s_i_z_e was not zero.
-
- PCRE_ERROR_BADOPTION (-4)
-
- An unrecognized bit was set in the _o_p_t_i_o_n_s argument.
-
- PCRE_ERROR_BADMAGIC (-5)
-
- PCRE stores a 4-byte "magic number" at the start of the
- compiled code, to catch the case when it is passed a junk
- pointer. This is the error it gives when the magic number
- isn't present.
-
- PCRE_ERROR_UNKNOWN_NODE (-6)
-
- While running the pattern match, an unknown item was
- encountered in the compiled pattern. This error could be
- caused by a bug in PCRE or by overwriting of the compiled
- pattern.
-
- PCRE_ERROR_NOMEMORY (-7)
-
- If a pattern contains back references, but the _o_v_e_c_t_o_r that
- is passed to ppppccccrrrreeee____eeeexxxxeeeecccc(((()))) is not big enough to remember the
-
-
-
- Page 7 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- referenced substrings, PCRE gets a block of memory at the
- start of matching to use for this purpose. If the call via
- ppppccccrrrreeee____mmmmaaaalllllllloooocccc(((()))) fails, this error is given. The memory is
- freed at the end of matching.
-
-
-
- IIIINNNNFFFFOOOORRRRMMMMAAAATTTTIIIIOOOONNNN AAAABBBBOOOOUUUUTTTT AAAA PPPPAAAATTTTTTTTEEEERRRRNNNN
- The ppppccccrrrreeee____iiiinnnnffffoooo(((()))) function returns information about a
- compiled pattern. Its yield is the number of capturing
- subpatterns, or one of the following negative numbers:
-
- PCRE_ERROR_NULL the argument _c_o_d_e was NULL
- PCRE_ERROR_BADMAGIC the "magic number" was not found
-
- If the _o_p_t_p_t_r argument is not NULL, a copy of the options
- with which the pattern was compiled is placed in the integer
- it points to.
-
- If the _f_i_r_s_t_c_h_a_r_p_t_r argument is not NULL, is is used to pass
- back information about the first character of any matched
- string. If there is a fixed first character, e.g. from a
- pattern such as (cat|cow|coyote), then it is returned in the
- integer pointed to by _f_i_r_s_t_c_h_a_r_p_t_r. Otherwise, if the
- pattern was compiled with the PCRE_MULTILINE option, and
- every branch started with "^", then -1 is returned,
- indicating that the pattern will match at the start of a
- subject string or after any "\n" within the string.
- Otherwise -2 is returned.
-
-
-
- LLLLIIIIMMMMIIIITTTTAAAATTTTIIIIOOOONNNNSSSS
- There are some size limitations in PCRE but it is hoped that
- they will never in practice be relevant. The maximum length
- of a compiled pattern is 65539 (sic) bytes. All values in
- repeating quantifiers must be less than 65536. The maximum
- number of capturing subpatterns is 99. The maximum number
- of all parenthesized subpatterns, including capturing
- subpatterns and assertions, is 200.
-
- The maximum length of a subject string is the largest
- positive number that an integer variable can hold. However,
- PCRE uses recursion to handle subpatterns and indefinite
- repetition. This means that the available stack space may
- limit the size of a subject string that can be processed by
- certain patterns.
-
-
-
- DDDDIIIIFFFFFFFFEEEERRRREEEENNNNCCCCEEEESSSS FFFFRRRROOOOMMMM PPPPEEEERRRRLLLL
- The differences described here are with respect to Perl
-
-
-
- Page 8 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- 5.004.
-
- 1. By default, a whitespace character is any character that
- the C library function iiiissssssssppppaaaacccceeee(((()))) recognizes, though it is
- possible to compile PCRE with alternative character type
- tables. Normally iiiissssssssppppaaaacccceeee(((()))) matches space, formfeed, newline,
- carriage return, horizontal tab, and vertical tab. Perl 5 no
- longer includes vertical tab in its set of whitespace
- characters. The \v escape that was in the Perl documentation
- for a long time was never in fact recognized. However, the
- character itself was treated as whitespace at least up to
- 5.002. In 5.004 it does not match \s.
-
- 2. PCRE does not allow repeat quantifiers on lookahead
- assertions. Perl permits them, but they do not mean what you
- might think. For example, "(?!a){3}" does not assert that
- the next three characters are not "a". It just asserts that
- the next character is not "a" three times.
-
- 3. Capturing subpatterns that occur inside negative
- lookahead assertions are counted, but their entries in the
- offsets vector are never set. Perl sets its numerical
- variables from any such patterns that are matched before the
- assertion fails to match something (thereby succeeding), but
- only if the negative lookahead assertion contains just one
- branch.
-
- 4. Though binary zero characters are supported in the
- subject string, they are not allowed in a pattern string
- because it is passed as a normal C string, terminated by
- zero. The escape sequence "\0" can be used in the pattern to
- represent a binary zero.
-
- 5. The following Perl escape sequences are not supported:
- \l, \u, \L, \U, \E, \Q. In fact these are implemented by
- Perl's general string-handling and are not part of its
- pattern matching engine.
-
- 6. The Perl \G assertion is not supported as it is not
- relevant to single pattern matches.
-
- 7. If a backreference can never be matched, PCRE diagnoses
- an error. In a case like
-
- /(123)\2/
-
- the error occurs at compile time. Perl gives no compile time
- error; version 5.004 either always fails to match, or gives
- a segmentation fault at runtime. In more complicated cases
- such as
-
- /(1)(2)(3)(4)(5)(6)(7)(8)(9)(10\10)/
-
-
-
- Page 9 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- PCRE returns PCRE_ERROR_BADREF at run time. Perl always
- fails to match.
-
- 8. PCRE provides some extensions to the Perl regular
- expression facilities:
-
- (a) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
- set, the $ meta- character matches only at the very end of
- the string.
-
- (b) If PCRE_EXTRA is set, the \X assertion (a Prolog-like
- "cut") is recognized, and a backslash followed by a letter
- with no special meaning is faulted. There is also a new kind
- of parenthesized subpattern starting with (?> which has a
- block on backtracking into it once it has matched.
-
- (c) If PCRE_UNGREEDY is set, the greediness of the
- repetition quantifiers is inverted, that is, by default they
- are not greedy, but if followed by a question mark they are.
-
-
-
- RRRREEEEGGGGUUUULLLLAAAARRRR EEEEXXXXPPPPRRRREEEESSSSSSSSIIIIOOOONNNN DDDDEEEETTTTAAAAIIIILLLLSSSS
- The syntax and semantics of the regular expressions
- supported by PCRE are described below. Regular expressions
- are also described in the Perl documentation and in a number
- of other books, some of which have copious examples. Jeffrey
- Friedl's "Mastering Regular Expressions", published by
- O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
- The description here is intended as reference documentation.
-
- A regular expression is a pattern that is matched against a
- subject string from left to right. Most characters stand for
- themselves in a pattern, and match the corresponding
- characters in the subject. As a trivial example, the pattern
-
- The quick brown fox
-
- matches a portion of a subject string that is identical to
- itself. The power of regular expressions comes from the
- ability to include alternatives and repetitions in the
- pattern. These are encoded in the pattern by the use of
- _m_e_t_a-_c_h_a_r_a_c_t_e_r_s, which do not stand for themselves but
- instead are interpreted in some special way.
-
- There are two different sets of meta-characters: those that
- are recognized anywhere in the pattern except within square
- brackets, and those that are recognized in square brackets.
- Outside square brackets, the meta-characters are as follows:
-
- \ general escape character with several uses
- ^ assert start of subject (or line, in multiline
-
-
-
- Page 10 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- mode)
- $ assert end of subject (or line, in multiline mode)
- . match any character except newline (by default)
- [ start character class definition
- | start of alternative branch
- ( start subpattern
- ) end subpattern
- ? extends the meaning of (
- also 0 or 1 quantifier
- also quantifier minimizer
- * 0 or more quantifier
- + 1 or more quantifier
- { start min/max quantifier
-
- Part of a pattern that is in square brackets is called a
- "character class". In a character class the only meta-
- characters are:
-
- \ general escape character
- ^ negate the class, but only if the first character
- - indicates character range
- ] terminates the character class
-
- The following sections describe the use of each of the
- meta-characters.
-
-
-
- BBBBAAAACCCCKKKKSSSSLLLLAAAASSSSHHHH
- The backslash character has several uses. Firstly, if it is
- followed by a non-alphameric character, it takes away any
- special meaning that character may have. This use of
- backslash as an escape character applies both inside and
- outside character classes.
-
- For example, if you want to match a "*" character, you write
- "\*" in the pattern. This applies whether or not the
- following character would otherwise be interpreted as a
- meta-character, so it is always safe to precede a non-
- alphameric with "\" to specify that it stands for itself. In
- particular, if you want to match a backslash, you write
- "\\".
-
- If a pattern is compiled with the PCRE_EXTENDED option,
- whitespace in the pattern and characters between a "#"
- outside a character class and the next newline character are
- ignored. An escaping backslash can be used to include a
- whitespace or "#" character as part of the pattern.
-
- A second use of backslash provides a way of encoding non-
- printing characters in patterns in a visible manner. There
- is no restriction on the appearance of non-printing
-
-
-
- Page 11 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- characters, apart from the binary zero that terminates a
- pattern, but when a pattern is being prepared by text
- editing, it is usually easier to use one of the following
- escape sequences than the binary character it represents:
-
- \a alarm, that is, the BEL character (hex 07)
- \cx "control-x", where x is any character
- \e escape (hex 1B)
- \f formfeed (hex 0C)
- \n newline (hex 0A)
- \r carriage return (hex 0D)
- \t tab (hex 09)
- \xhh character with hex code hh
- \ddd character with octal code ddd or backreference
-
- The precise effect of "\cx" is as follows: if "x" is a lower
- case letter, it is converted to upper case. Then bit 6 of
- the character (hex 40) is inverted. Thus "\cz" becomes hex
- 1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
-
- After "\x", up to two hexadecimal digits are read (letters
- can be in upper or lower case).
-
- After "\0" up to two further octal digits are read. In both
- cases, if there are fewer than two digits, just those that
- are present are used. Thus the sequence "\0\x\07" specifies
- two binary zeros followed by a BEL character. Make sure you
- supply two digits if the character that follows could
- otherwise be taken as another digit.
-
- The handling of a backslash followed by a digit other than 0
- is complicated. Outside a character class, PCRE reads it
- and any following digits as a decimal number. If the number
- is less than 10, or if there have been at least that many
- previous capturing left parentheses in the expression, the
- entire sequence is taken as a _b_a_c_k _r_e_f_e_r_e_n_c_e. A description
- of how this works is given later, following the discussion
- of parenthesized subpatterns.
-
- Inside a character class, or if the decimal number is
- greater than 9 and there have not been that many capturing
- subpatterns, PCRE re-reads up to three octal digits
- following the backslash, and generates a single byte from
- the least significant 8 bits of the value. Any subsequent
- digits stand for themselves. For example:
-
- \040 is another way of writing a space
- \40 is the same, provided there are fewer than 40
- previous capturing subpatterns
- \7 is always a back reference
- \11 might be a back reference, or another way of
- writing a tab
-
-
-
- Page 12 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- 11 is always a tab
- \0113 is a tab followed by the character "3"
- \113 is the character with octal code 113 (since there
- can be no more than 99 back references)
- \377 is a byte consisting entirely of 1 bits
- \81 is either a back reference, or a binary zero
- followed by the two characters "8" and "1"
-
- Note that octal values of 100 or greater must not be
- introduced by a leading zero, because no more than three
- octal digits are ever read.
-
- All the sequences that define a single byte value can be
- used both inside and outside character classes. In addition,
- inside a character class, the sequence "\b" is interpreted
- as the backspace character (hex 08). Outside a character
- class it has a different meaning (see below).
-
- The third use of backslash is for specifying generic
- character types:
-
- \d any decimal digit
- \D any character that is not a decimal digit
- \s any whitespace character
- \S any character that is not a whitespace character
- \w any "word" character
- \W any "non-word" character
-
- Each pair of escape sequences partitions the complete set of
- characters into two disjoint sets. Any given character
- matches one, and only one, of each pair.
-
- A "word" character is any letter or digit or the underscore
- character, that is, any character which can be part of a
- Perl "word". These character type sequences can appear both
- inside and outside character classes. They each match one
- character of the appropriate type. If the current matching
- point is at the end of the subject string, all of them fail,
- since there is no character to match.
-
- The fourth use of backslash is for certain assertions. An
- assertion specifies a condition that has to be met at a
- particular point in a match, without consuming any
- characters from the subject string. The backslashed
- assertions are
-
- \b word boundary
- \B not a word boundary
- \A start of subject (independent of multiline mode)
- \Z end of subject (independent of multiline mode)
-
- Assertions may not appear in character classes (but note
-
-
-
- Page 13 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- that "\b" has a different meaning, namely the backspace
- character, inside a character class).
-
- A word boundary is a position in the subject string where
- the current character and the previous character do not both
- match "\w" or "\W" (i.e. one matches "\w" and the other
- matches "\W"), or the start or end of the string if the
- first or last character matches "\w", respectively. More
- complicated assertions are also supported (see below).
-
- The "\A" and "\Z" assertions differ from the traditional "^"
- and "$" (described below) in that they only ever match at
- the very start and end of the subject string, respectively,
- whatever options are set.
-
- When the PCRE_EXTRA flag is set on a call to ppppccccrrrreeee____ccccoooommmmppppiiiilllleeee(((()))),
- the additional assertion \X, which has no equivalent in
- Perl, is recognized. This operates like the "cut" operation
- in Prolog: it prevents the matching operation from
- backtracking past it. For example, if the expression
-
- .*/foo
-
- is matched against the string "/this/string/is/not" then
- after the greedy .* has swallowed the whole string, PCRE
- keeps backtracking all the way to the beginning before
- failing. If, on the other hand, the expression is
-
- .*/\Xfoo
-
- then once it has discovered that "/not" is not "/foo",
- backtracking ceases, and the match fails. See also the
- section on "once-only" subpatterns below.
-
-
-
-
- CCCCIIIIRRRRCCCCUUUUMMMMFFFFLLLLEEEEXXXX AAAANNNNDDDD DDDDOOOOLLLLLLLLAAAARRRR
- Outside a character class, the circumflex character is an
- assertion which is true only if the current matching point
- is at the start of the subject string, in the default
- matching mode. Inside a character class, circumflex has an
- entirely different meaning (see below).
-
- Circumflex need not be the first character of the pattern if
- a number of alternatives are involved, but it should be the
- first thing in each alternative in which it appears if the
- pattern is ever to match that branch. If all possible
- alternatives start with a circumflex, that is, if the
- pattern is constrained to match only at the start of the
- subject, it is said to be an "anchored" pattern. (There are
- also other constructs that can cause a pattern to be
-
-
-
- Page 14 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- anchored.)
-
- A dollar character is an assertion which is true only if the
- current matching point is at the end of the subject string,
- or immediately before a newline character that is the last
- character in the string (by default). Dollar need not be the
- last character of the pattern if a number of alternatives
- are involved, but it should be the last item in any branch
- in which it appears. Dollar has no special meaning in a
- character class.
-
- The meaning of dollar can be changed so that it matches only
- at the very end of the string, by setting the
- PCRE_DOLLAR_ENDONLY option at compile or matching time.
-
- The meanings of the circumflex and dollar characters are
- changed if the PCRE_MULTILINE option is set at compile or
- matching time. When this is the case, they match immediately
- after and immediately before an internal "\n" character,
- respectively, in addition to matching at the start and end
- of the subject string. For example, the pattern /^abc$/
- matches the subject string "def\nabc" in multiline mode, but
- not otherwise. Consequently, patterns that are anchored in
- single line mode because all branches start with "^" are not
- anchored in multiline mode. The PCRE_DOLLAR_ENDONLY option
- is ignored if PCRE_MULTILINE is set.
-
- Note that the sequences "\A" and "\Z" can be used to match
- the start and end of the subject in both modes, and if all
- branches of a pattern start with "\A" is it always anchored.
-
-
-
- FFFFUUUULLLLLLLL SSSSTTTTOOOOPPPP ((((PPPPEEEERRRRIIIIOOOODDDD,,,, DDDDOOOOTTTT))))
- Outside a character class, a dot in the pattern matches any
- one character in the subject, including a non-printing
- character, but not (by default) newline. If the PCRE_DOTALL
- option is set, then dots match newlines as well. The
- handling of dot is entirely independent of the handling of
- circumflex and dollar, the only relationship being that they
- both involve newline characters. Dot has no special meaning
- in a character class.
-
-
-
- SSSSQQQQUUUUAAAARRRREEEE BBBBRRRRAAAACCCCKKKKEEEETTTTSSSS
- An opening square bracket introduces a character class,
- terminated by a closing square bracket. A closing square
- bracket on its own is not special. If a closing square
- bracket is required as a member of the class, it should be
- the first data character in the class (after an initial
- circumflex, if present) or escaped with \.
-
-
-
- Page 15 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- A character class matches a single character in the subject;
- the character must be in the set of characters defined by
- the class, unless the first character in the class is a
- circumflex, in which case the subject character must not be
- in the set defined by the class. If a circumflex is actually
- required as a member of the class, ensure it is not the
- first character, or escape it with \.
-
- For example, the character class [aeiou] matches any lower
- case vowel, while [^aeiou] matches any character that is not
- a lower case vowel. Note that a circumflex is just a
- convenient notation for specifying the characters which are
- in the class by enumerating those that are not. It is not an
- assertion: it still consumes a character from the subject
- string, and fails if the current pointer is at the end of
- the string.
-
- The newline character is never treated in any special way in
- character classes, whatever the setting of the PCRE_DOTALL
- or PCRE_MULTILINE options is. A class such as [^a] will
- always match a newline.
-
- The minus (hyphen) character can be used to specify a range
- of characters in a character class. For example, [d-m]
- matches any letter between d and m, inclusive. If a minus
- character is required in a class, it must be escaped with \
- or appear in a position where it cannot be interpreted as
- indicating a range, typically as the first or last character
- in the class. It is not possible to have the character "]"
- as the end character of a range, since a sequence such as
- [w-] is interpreted as a class of two characters. The octal
- or hexadecimal representation of "]" can, however, be used
- to end a range.
-
- Ranges operate in ASCII collating sequence. They can also be
- used for characters specified numerically, for example
- [\000-\037]. If a range such as [W-c] is used when
- PCRE_CASELESS is set, it matches the letters involved in
- either case.
-
- The character types \d, \D, \s, \S, \w, and \W may also
- appear in a character class, and add the characters that
- they match to the class. For example, the class [^\W_]
- matches any letter or digit.
-
- All non-alphameric characters other than \, -, ^ (at the
- start) and the terminating ] are non-special in character
- classes, but it does no harm if they are escaped.
-
-
-
- VVVVEEEERRRRTTTTIIIICCCCAAAALLLL BBBBAAAARRRR
-
-
-
- PPPPaaaaggggeeee 11116666 ((((pppprrrriiiinnnntttteeeedddd 11112222////11110000////99998888))))
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- Vertical bar characters are used to separate alternative
- patterns. The matching process tries all the alternatives in
- turn. For example, the pattern
-
- gilbert|sullivan
-
- matches either "gilbert" or "sullivan". Any number of
- alternatives can be used, and an empty alternative is
- permitted (matching the empty string).
-
-
-
- SSSSUUUUBBBBPPPPAAAATTTTTTTTEEEERRRRNNNNSSSS
- Subpatterns are delimited by parentheses (round brackets),
- which can be nested. Marking part of a pattern as a
- subpattern does two things:
-
- 1. It localizes a set of alternatives. For example, the
- pattern
-
- cat(aract|erpillar|)
-
- matches one of the words "cat", "cataract", or
- "caterpillar". Without the parentheses, it would match
- "cataract", "erpillar" or the empty string.
-
- 2. It sets up the subpattern as a capturing subpattern (as
- defined above). When the whole pattern matches, that
- portion of the subject string that matched the subpattern is
- passed back to the caller via the _o_v_e_c_t_o_r argument of
- ppppccccrrrreeee____eeeexxxxeeeecccc(((()))). Opening parentheses are counted from left to
- right (starting from 1) to obtain the numbers of the
- capturing subpatterns.
-
- For example, if the string "the red king" is matched against
- the pattern
-
- the ((red|white) (king|queen))
-
- the captured substrings are "red king", "red", and "king",
- and are numbered 1, 2, and 3.
-
- The fact that plain parentheses fulfil two functions is not
- always helpful. There are often times when a grouping
- subpattern is required without a capturing requirement. If
- an opening parenthesis is followed by "?:", the subpattern
- does not do any capturing, and is not counted when computing
- the number of any subsequent capturing subpatterns. For
- example, if the string "the white queen" is matched against
- the pattern
-
- the ((?:red|white) (king|queen))
-
-
-
- Page 17 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- the captured substrings are "white queen" and "queen", and
- are numbered 1 and 2. The maximum number of captured
- substrings is 99, and the maximum number of all subpatterns,
- both capturing and non-capturing, is 200.
-
-
-
- BBBBAAAACCCCKKKK RRRREEEEFFFFEEEERRRREEEENNNNCCCCEEEESSSS
- Outside a character class, a backslash followed by a digit
- greater than 0 (and possibly further digits) is a back
- reference to a capturing subpattern earlier (i.e. to its
- left) in the pattern, provided there have been that many
- previous capturing left parentheses. However, if the decimal
- number following the backslash is less than 10, it is always
- taken as a back reference, and causes an error if there have
- not been that many previous capturing left parentheses. See
- the section entitled "Backslash" above for further details
- of the handling of digits following a backslash.
-
- A back reference matches whatever actually matched the
- capturing subpattern in the current subject string, rather
- than anything matching the subpattern itself. So the pattern
-
- (sens|respons)e and \1ibility
-
- matches "sense and sensibility" and "response and
- responsibility", but not "sense and responsibility".
-
- There may be more than one back reference to the same
- subpattern. If a subpattern has not actually been used in a
- particular match, then any back references to it always
- fail. For example, the pattern
-
- (a|(bc))\2
-
- always fails if it starts to match "a" rather than "bc".
- Because there may be up to 99 back references, all digits
- following the backslash are taken as part of a potential
- back reference number. If the pattern continues with a digit
- character, then some delimiter must be used to terminate the
- back reference. If the PCRE_EXTENDED option is set, this can
- be whitespace. Otherwise an empty comment can be used.
-
-
-
- RRRREEEEPPPPEEEETTTTIIIITTTTIIIIOOOONNNN
- Repetition is specified by quantifiers, which can follow any
- of the following items:
-
- a single character, possibly escaped
- the . metacharacter
- a character class
-
-
-
- Page 18 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- a back reference
- a parenthesized subpattern
-
- The general repetition quantifier specifies a minimum and
- maximum number of permitted matches, by giving the two
- numbers in curly brackets (braces), separated by a comma.
- The numbers must be less than 65536, and the first must be
- less than or equal to the second. For example:
-
- z{2,4}
-
- matches "zz", "zzz", or "zzzz". A closing brace on its own
- is not a special character. If the second number is omitted,
- but the comma is present, there is no upper limit; if the
- second number and the comma are both omitted, the quantifier
- specifies an exact number of required matches. Thus
-
- [aeiou]{3,}
-
- matches at least 3 successive vowels, but may match many
- more, while
-
- \d{8}
-
- matches exactly 8 digits. An opening curly bracket that
- appears in a position where a quantifier is not allowed, or
- one that does not match the syntax of a quantifier, is taken
- as a literal character. For example, "{,6}" is not a
- quantifier, but a literal string of four characters.
-
- The quantifier {0} is permitted, causing the expression to
- behave as if the previous item and the quantifier were not
- present.
-
- For convenience (and historical compatibility) the three
- most common quantifiers have single-character abbreviations:
-
- * is equivalent to {0,}
- + is equivalent to {1,}
- ? is equivalent to {0,1}
-
- By default, the quantifiers are "greedy", that is, they
- match as much as possible (up to the maximum number of
- permitted times), without causing the rest of the pattern to
- fail. The classic example of where this gives problems is in
- trying to match comments in C programs. These appear between
- the sequences /* and */ and within the sequence, individual
- * and / characters may appear. An attempt to match C
- comments by applying the pattern
-
- /\*.*\*/
-
-
-
-
- Page 19 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- to the string
-
- /* first command */ not comment /* second comment */
-
- fails, because it matches the entire string due to the
- greediness of the .* item.
-
- However, if a quantifier is followed by a question mark,
- then it ceases to be greedy, and instead matches the minimum
- number of times possible, so the pattern
-
- /\*.*?\*/
-
- does the right thing with the C comments. The meaning of the
- various quantifiers is not otherwise changed, just the
- preferred number of matches. Do not confuse this use of
- question mark with its use as a quantifier in its own right.
- Because it has two uses, it can sometimes appear doubled, as
- in
-
- \d??\d
-
- which matches one digit by preference, but can match two if
- that is the only way the rest of the pattern matches.
-
- If the PCRE_UNGREEDY option is set (an option which is not
- available in Perl) then the quantifiers are not greedy by
- default, but individual ones can be made greedy by following
- they by a question mark. In other words, it inverts the
- default behaviour.
-
- When a parenthesized subpattern is quantified with a minimum
- repeat count that is greater than 1 or with a limited
- maximum, more store is required for the compiled pattern, in
- proportion to the size of the minimum or maximum.
-
- If a pattern starts with .* then it is implicitly anchored,
- since whatever follows will be tried against every character
- position in the subject string. PCRE treats this as though
- it were preceded by \A.
-
- When a capturing subpattern is repeated, the value captured
- is the substring that matched the final iteration. For
- example,
-
- (tweedle[dume]{3})+\1
-
- matches "tweedledum tweedledee tweedledee" but not
- "tweedledum tweedledee tweedledum".
-
-
-
-
-
-
- Page 20 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- AAAASSSSSSSSEEEERRRRTTTTIIIIOOOONNNNSSSS
- An assertion is a test on the characters following the
- current matching point that does not actually consume any of
- those characters. The simple assertions coded as \b, \B, \A,
- \Z, ^ and $ are described above. More complicated assertions
- are coded as subpatterns starting with (?= for positive
- assertions, and (?! for negative assertions. For example,
-
- \w+(?=;)
-
- matches a word followed by a semicolon, but does not include
- the semicolon in the match, and
-
- foo(?!bar)
-
- matches any occurrence of "foo" that is not followed by
- "bar". Note that the apparently similar pattern
-
- (?!foo)bar
-
- does not find an occurrence of "bar" that is preceded by
- something other than "foo"; it finds any occurrence of "bar"
- whatsoever, because the assertion (?!foo) is always true
- when the next three characters are "bar".
-
- Assertion subpatterns are not capturing subpatterns, and may
- not be repeated, because it makes no sense to assert the
- same thing several times. If an assertion contains capturing
- subpatterns within it, these are always counted for the
- purposes of numbering the capturing subpatterns in the whole
- pattern. Substring capturing is carried out for positive
- assertions, but it does not make sense for negative
- assertions.
-
- Assertions count towards the maximum of 200 parenthesized
- subpatterns.
-
-
-
- OOOONNNNCCCCEEEE----OOOONNNNLLLLYYYY SSSSUUUUBBBBPPPPAAAATTTTTTTTEEEERRRRNNNNSSSS
- The facility described in this section is available only
- when the PCRE_EXTRA option is set at compile time. It is an
- extension to Perl regular expressions.
-
- With both maximizing and minimizing repetition, failure of
- what follows normally causes the repeated item to be re-
- evaluated to see if a different number of repeats allows the
- rest of the pattern to match. Sometimes it is useful to
- prevent this, either to change the nature of the match, or
- to cause it fail earlier than it otherwise might when the
- author of the pattern knows there is no point in carrying
- on.
-
-
-
- Page 21 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- Consider, for example, the pattern \d+foo when applied to
- the subject line
-
- 123456bar
-
- After matching all 6 digits and then failing to match "foo",
- the normal action of the matcher is to try again with only 5
- digits matching the \d+ item, and then with 4, and so on,
- before ultimately failing. Once-only subpatterns provide the
- means for specifying that once a portion of the pattern has
- matched, it is not to be re-evaluated in this way, so the
- matcher would give up immediately on failing to match "foo"
- the first time. The notation is another kind of special
- parenthesis, starting with (?> as in this example:
-
- (?>
- +)bar
-
- This kind of parenthesis "locks up" the part of the pattern
- it contains once it has matched, and a failure further into
- the pattern is prevented from backtracking into it.
- Backtracking past it to previous items, however, works as
- normal.
-
- For simple cases such as the above example, this feature can
- be though of as a maximizing repeat that must swallow
- everything it can. So, while both \d+ and \d+? are prepared
- to adjust the number of digits they match in order to make
- the rest of the pattern match, (?>\d+) can only match an
- entire sequence of digits.
-
- This construction can of course contain arbitrarily
- complicated subpatterns, and it can be nested. Contrast with
- the \X assertion, which is a Prolog-like "cut".
-
-
-
- CCCCOOOOMMMMMMMMEEEENNNNTTTTSSSS
- The sequence (?# marks the start of a comment which
- continues up to the next closing parenthesis. Nested
- parentheses are not permitted. The characters that make up a
- comment play no part in the pattern matching at all.
-
- If the PCRE_EXTENDED option is set, an unescaped # character
- outside a character class introduces a comment that
- continues up to the next newline character in the pattern.
-
-
-
- IIIINNNNTTTTEEEERRRRNNNNAAAALLLL FFFFLLLLAAAAGGGG SSSSEEEETTTTTTTTIIIINNNNGGGG
- If the sequence (?i) occurs anywhere in a pattern, it has
- the effect of setting the PCRE_CASELESS option, that is, all
- letters are matched in a case-independent manner. The option
-
-
-
- Page 22 (printed 12/10/98)
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- applies to the whole pattern, not just to the portion that
- follows it.
-
- If the sequence (?m) occurs anywhere in a pattern, it has
- the effect of setting the PCRE_MULTILINE option, that is,
- subject strings matched by this pattern are treated as
- consisting of multiple lines.
-
- If the sequence (?s) occurs anywhere in a pattern, it has
- the effect of setting the PCRE_DOTALL option, so that dot
- metacharacters match newlines as well as all other
- characters.
-
- If the sequence (?x) occurs anywhere in a pattern, it has
- the effect of setting the PCRE_EXTENDED option, that is,
- whitespace is ignored and # introduces a comment that lasts
- till the next newline. The option applies to the whole
- pattern, not just to the portion that follows it.
-
- If the sequence (?U) occurs anywhere in a pattern, it has
- the effect of setting the PCRE_UNGREEDY option which inverts
- the greediness of quantifiers. This is an extension to
- Perl's facilities.
-
- If the sequence (?X) occurs in a pattern, it has the effect
- of setting the PCRE_EXTRA flag, which turns on some
- additional features not found in Perl. This flag setting is
- special in that it must occur earlier in the pattern than
- any of the additional features. It is best put at the start.
-
- If more than one option is required, they can be specified
- jointly, for example as (?ix) or (?mi).
-
-
-
- PPPPEEEERRRRFFFFOOOORRRRMMMMAAAANNNNCCCCEEEE
- Certain items that may appear in patterns are more efficient
- than others. It is more efficient to use a character class
- like [aeiou] than a set of alternatives such as (a|e|i|o|u).
- In general, the simplest construction that provides the
- required behaviour is usually the most efficient. Jeffrey
- Friedl's book contains a lot of discussion about optimizing
- regular expressions for efficient performance.
-
- The use of PCRE_MULTILINE causes additional processing and
- should be avoided when it is not necessary. Caseless
- matching of character classes is more efficient if
- PCRE_CASELESS is set when the pattern is compiled.
-
-
-
- AAAAUUUUTTTTHHHHOOOORRRR
-
-
-
- PPPPaaaaggggeeee 22223333 ((((pppprrrriiiinnnntttteeeedddd 11112222////11110000////99998888))))
-
-
-
-
-
-
- PPPPCCCCRRRREEEE((((3333)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV PPPPCCCCRRRREEEE((((3333))))
-
-
-
- Philip Hazel <ph10@cam.ac.uk>
- University Computing Service,
- New Museums Site,
- Cambridge CB2 3QG, England.
- Phone: +44 1223 334714
-
- Copyright (c) 1998 University of Cambridge.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 24 (printed 12/10/98)
-
-
-
-